[00:01.470 --> 00:10.050]  Hello everyone and welcome to our next talk which is mine. To introduce myself, which is wrong,
[00:10.050 --> 00:16.490]  I did a postdoc in machine learning after getting my PhD in algebraic topology with some of the
[00:16.490 --> 00:22.450]  people who did some of the math behind the concepts in this talk and I've just extended
[00:22.450 --> 00:30.810]  it for this thing. So that's enough about me. Let me play my talk back for you.
[00:32.490 --> 00:34.530]  So, here we go.
[00:34.530 --> 00:39.830]  I welcome everyone to my talk on calculating concepts very quickly with this library called
[00:39.830 --> 01:01.730]  Google which is based on... I hear no audio.
[01:05.790 --> 01:06.930]  Oopsie daisy.
[01:28.690 --> 01:31.430]  Um, there's no... do you hear audio from the talk?
[01:38.900 --> 01:41.260]  Okay, well that's the problem I'm trying to...
[02:02.030 --> 02:04.050]  Uh, sorry about this guys.
[02:22.810 --> 02:27.950]  Uh, sorry, I don't know what's going on.
[03:18.250 --> 03:23.590]  Um, I am... I'll apologize.
[03:34.220 --> 03:39.140]  Key takeaway for this talk is this is a novel way of calculating drift.
[03:39.980 --> 03:44.440]  Welcome everyone to my talk on calculating concept drift very quickly with this library
[03:44.440 --> 03:49.400]  called Google which is based on cover trees. My name is Sven Kittel. I am a senior data
[03:49.400 --> 03:53.940]  scientist at Elastic and my Twitter handle is co-mathematician.
[03:54.480 --> 04:01.660]  So, the key outline... key takeaway for this talk is this is a novel way of calculating drift.
[04:01.660 --> 04:05.970]  It's done using a new technique using cover trees.
[04:06.800 --> 04:13.600]  And because it's so fast, it's log-in time to actually calculate drift for a new sample.
[04:13.600 --> 04:19.460]  We can use it for on... in new ways that concept drift couldn't have been applied before now.
[04:19.660 --> 04:22.560]  So, we're going to go over the objective, why we care about concept drift,
[04:22.560 --> 04:28.360]  mathematically how it's done, then a couple results, and then finishing up with next steps.
[04:28.860 --> 04:33.780]  So, why do we care about data set drifts? Well, our data drifts.
[04:35.300 --> 04:41.040]  So, I work with malware and there are new malware families coming out every month.
[04:41.040 --> 04:45.300]  There are new benign pieces, examples of software coming out every month.
[04:45.300 --> 04:49.580]  Every time that Adobe patches Photoshop, it changes the pattern slightly and it changes
[04:49.580 --> 04:54.980]  its location slightly in our data set. Every single time a new piece of malware comes up,
[04:54.980 --> 05:00.880]  it's a new... it changes the way our... the malware data is distributed slightly.
[05:01.500 --> 05:08.040]  Another problem... well, what those do, basically, which gets into the second problem,
[05:08.040 --> 05:14.720]  they lower the efficacy of our models that we deploy. So, our models expect data in a sort of
[05:14.720 --> 05:20.340]  the same sort of general location as our training data. And every time we deploy, we are kind of
[05:20.340 --> 05:27.640]  fixing our model's view of the world at a certain day, and then we're training it and we're testing
[05:27.640 --> 05:34.800]  it on essentially future data. So, it doesn't have a perspective for the... if it's trained in March,
[05:34.800 --> 05:40.800]  it doesn't have... it might not work very well at the end... data coming in at the end of April.
[05:41.800 --> 05:46.620]  So, one of the methods for dealing with this is you build a dashboard that basically tracks the
[05:46.620 --> 05:53.340]  error. And when the error gets too high, you discard the model and go on for a new one.
[05:53.340 --> 05:58.460]  But the problem is, I don't trust VirusTotalLabels, and I don't think anyone really should trust
[05:59.400 --> 06:03.000]  VirusTotalLabels. And that's the best way we have because of the sheer volume of data
[06:03.000 --> 06:10.480]  we have to kind of trust VirusTotalLabels. So, and for other things, I wouldn't trust
[06:10.480 --> 06:16.660]  early labels on data because we have to move very quickly and sometimes with whitelisting and stuff.
[06:18.380 --> 06:24.520]  And, you know, well, and sometimes we only get like a label when our customer complains and we'd rather
[06:25.160 --> 06:29.920]  get ahead of that and figure out like, hey, where's the drift happening? What's going on?
[06:29.920 --> 06:34.900]  And then additionally, aside from the fact that things are just changing all the time,
[06:34.900 --> 06:39.060]  our models are also under attack. People are trying to bypass them by doing new and weird
[06:39.060 --> 06:46.580]  things to their data, like packing a spam filter and seeing if that works, or like posting a
[06:46.580 --> 06:51.100]  different pattern to see if that works, to see if they can get past like the Facebook spam filter.
[06:51.100 --> 06:56.020]  They're constantly innovating with a specific goal in mind of bypassing our models.
[06:56.540 --> 07:02.140]  This may not result be an adversarial example per se from the literature, but it sure
[07:02.140 --> 07:08.500]  kind of acts like that sometimes. And additionally, we could be under attack via a poisoning system.
[07:09.420 --> 07:14.760]  Maybe if the detector is fast enough, we can get ahead of those things beforehand.
[07:15.640 --> 07:22.020]  So here's a trivial example. So we have these two data sets. One is our training data set. That's
[07:22.020 --> 07:29.500]  we've got 1, 1, 2, 1, 1. And so our training data set has 70% 1s, 20% 2s, and 10% 3s.
[07:29.900 --> 07:35.420]  And this is what we trained on. Realistically, there's no point in training a machine learning
[07:35.420 --> 07:43.140]  model on a sequence of numbers, but suppose we do. But then we deploy this model, and we get
[07:43.140 --> 07:48.960]  this other sequence of data. 2, 2, 1, 2, so on, so on. And when we actually deploy it, we see
[07:48.960 --> 07:54.420]  data coming over the wire. We have 30% 1s, 60% 2s, and 10% 3s. So that's a little bit different
[07:54.420 --> 08:01.340]  from our training set. So the real world is different from our training set. So in this case,
[08:01.340 --> 08:06.240]  this is a... we can model this distribution very easily because it's a discrete space.
[08:06.240 --> 08:12.640]  There are only three things that our data could be. There's only... it isn't continuous like a lot
[08:12.640 --> 08:17.600]  of the stuff we have to deal with in data science. So we can actually compute the distribution fairly
[08:17.600 --> 08:22.980]  easily. And this is a pretty good... we can feel confident in our estimate of the distribution.
[08:23.180 --> 08:26.760]  We have to do some Bayesian statistics to do this properly,
[08:28.000 --> 08:35.100]  which we're not getting into this talk. But this is basically how it works. And then once we have
[08:35.100 --> 08:40.460]  like a concept of our training distribution, and our real world distribution, or our test
[08:40.460 --> 08:48.720]  distribution, as I'll be calling it today, we can take the Kublai-Klepch divergence. So
[08:49.500 --> 08:55.930]  this is a sort of a measure of the distance between this distribution and this distribution.
[08:56.200 --> 09:03.760]  And it's very easy to calculate on this categorical discrete data because it's just
[09:03.760 --> 09:10.340]  going to be, well, 0.7 times log of 0.7 over 0.3. So that's the category for 1s over the
[09:11.160 --> 09:16.120]  1s. Take the log and then multiply it for the category for 1s. And that gives you the first
[09:16.120 --> 09:21.140]  term. Second term is the same thing for 2s. And the third term is the same thing for 3s.
[09:21.240 --> 09:26.520]  And then we add those all up and you get the KL divergence between these two distributions is
[09:26.520 --> 09:34.060]  0.16. So that's all great. But the problem is our data looks like this. And I don't know whether
[09:34.060 --> 09:42.200]  this has any difference. Are my blue points sampled from the same distribution as my orange
[09:42.200 --> 09:48.020]  points? I don't know. It's really hard to tell by just looking at this picture.
[09:48.600 --> 09:53.840]  And now, in this case, they are. So my orange points are sampled from a two-dimensional
[09:54.340 --> 10:02.040]  Gaussian, two-dimensional normal distribution, where it's just, it's a nice sphere. And they're
[10:03.200 --> 10:08.180]  both, the covariance matrix is 1, 1. And the covariance matrix with blues is 1, 1. And it
[10:08.180 --> 10:12.560]  gives you this nice distribution. And it looks okay. Now, what I'm going to do is I'm going to
[10:12.560 --> 10:19.220]  drift the blues slightly over. And can you tell, is this correct? Well, you know, every single time
[10:19.220 --> 10:23.580]  that I put a blue, it's right next to some orange points. So this looks like it could be from the
[10:23.580 --> 10:28.500]  distribution. I can't really tell with my eyes. And one of the techniques we might do is like,
[10:28.500 --> 10:33.040]  oh, well, let me take the k-nearest neighbors and see if the distance from my k-nearest neighbors
[10:33.640 --> 10:39.140]  over time goes up, then I'm out of distribution. But if you tell here, that's not going to tell
[10:39.140 --> 10:46.000]  you much. If I go here, well, oh, I've got some outliers here now. So maybe that will work now.
[10:46.060 --> 10:50.200]  If I go here, well, this is really distributed. So this center, the center of this distribution
[10:50.200 --> 10:53.960]  should be around here. And the center for this distribution is around here. So that's really
[10:53.960 --> 10:59.480]  drifted. But you've got some outliers over here. Now, what happens when there's like millions of
[10:59.480 --> 11:03.760]  orange points and there's only a few hundred blue points? Well, you might not be able to tell the
[11:03.760 --> 11:10.060]  difference with that in that case. So, and also like, there's all this, what I've described to
[11:10.060 --> 11:14.020]  you is kind of a feeling. And to actually get at the math of this thing, I have to like model the
[11:14.020 --> 11:19.300]  distributions that these came from. And that's quite a hard problem. And especially in high
[11:19.300 --> 11:25.320]  dimensions. So things get kind of complicated. So the solution to this that I came up with is to
[11:25.320 --> 11:31.540]  use a cover tree. Now, why do I want to use a cover tree? So the cover tree, basically,
[11:32.120 --> 11:36.420]  the key takeaway for cover trees is it's a k-nearest neighbors data structure. There are
[11:36.420 --> 11:46.020]  many like it. There's kd trees, there's b trees, there's kd trees, you know,
[11:46.020 --> 11:54.200]  there's k-means trees, a whole bunch. And, well, the cover tree has this wonderful thing where
[11:54.940 --> 12:02.060]  in 2016, Mario Maggiorni and Wenjing Lau proved that it can arbitrarily well approximate the
[12:02.060 --> 12:07.080]  underlying distribution of the data. There's some caveats to that statement, and mathematically,
[12:07.080 --> 12:11.300]  it will probably, to properly statement, will take it, take the whole page. But basically,
[12:11.300 --> 12:16.580]  the whole concept of their proof is for a nice data set, and to know what a nice data set is,
[12:16.580 --> 12:21.600]  you need to know what a low dimensional manifold is. And you have to have enough data from that.
[12:22.000 --> 12:28.940]  But for a nice data set, which most of our data sets are pretty nice, sort of, kind of,
[12:29.120 --> 12:33.260]  a cover tree will arbitrarily well approximate the underlying distribution of the data.
[12:34.460 --> 12:39.160]  There's some caveats, and you can go read their paper if you want on exactly what that means.
[12:39.160 --> 12:44.140]  But for us, what this means is, if I use a cover tree to build a model of my data for
[12:44.140 --> 12:48.680]  k-nearest neighbors, I can take k-nearest neighbors, or I can just fiddle around with
[12:48.680 --> 12:52.500]  the properties of the cover tree to, like, get information about my data set.
[12:53.420 --> 12:57.440]  So, I can tell things like the local dimensionality, I can tell things like how
[12:57.440 --> 13:01.880]  things glue together, and I can tell, like, sort of, the shape of my data fairly well. I can tell
[13:01.880 --> 13:07.580]  clusters and things. I can infer what clusters are. Let's go build a very simple cover tree
[13:07.580 --> 13:13.060]  in two-dimensional data. So, here's a, you know, infinity sign, a bow tie, whatever you want to
[13:13.060 --> 13:17.900]  call it, and I start my cover tree by picking a point at random and then building a sphere.
[13:18.080 --> 13:23.360]  You know, in this case, it's a circle that covers everything. So, this is the start. And now,
[13:23.360 --> 13:27.440]  this doesn't help me that much. I haven't split up my data geometrically to, like,
[13:27.440 --> 13:33.420]  divide and conquer the k-nearest neighbors structure, because the naive way would take
[13:33.960 --> 13:38.880]  a linear time, big O of n, but we want to divide and conquer, so it takes log of n.
[13:38.880 --> 13:43.640]  So, here's what I do. I shrink my sphere down, add another one, and I cover it again.
[13:43.760 --> 13:49.380]  You can kind of tell that this object is longer than it is wide. It's kind of,
[13:49.380 --> 13:53.460]  if you really squint, it's kind of a one-dimensional thing in this direction.
[13:55.100 --> 13:59.960]  So, that's great. But now, if I go, okay, well, I'm going to split it up further, so really divide
[13:59.960 --> 14:04.920]  and conquer, and I, so I've got these two nodes, and they're both children of the previous node on
[14:04.920 --> 14:10.220]  the previous slide, and I'm going to build their children. So, here's their children.
[14:10.880 --> 14:16.760]  So, I divide and conquer their stuff, and I can kind of tell that each of their children have
[14:16.760 --> 14:23.580]  four, and if I, if you squint a little bit, it's kind of square-shaped in this dimension.
[14:23.740 --> 14:29.000]  At the scale, the top scale, it was too blurry. I could only see a line here. At the scale,
[14:29.000 --> 14:34.900]  I can kind of see that it's up in, it's got a nice shape, square shape, and so there's four
[14:34.900 --> 14:40.120]  children, and the reason, you know, that that dimensionality is relevant to the number of
[14:40.120 --> 14:45.860]  children you have. And then I can go further, and you know, it's got some more, it's kind of split
[14:45.860 --> 14:50.680]  up nicely, and I've got like a one-dimensional object here again, because all these children
[14:50.680 --> 14:55.940]  are kind of one-dimensional-ish, and I can split up some more, and you can see that it's kind of
[14:55.940 --> 15:00.980]  conforming to your data nicely, and at every single step of the way, I can make some inferences about
[15:00.980 --> 15:06.380]  what the shape of my data is. They're very cheap, because it's basically counting.
[15:06.980 --> 15:13.740]  So how do we build a probability distribution? So on slide two, we built a very simple
[15:13.740 --> 15:21.980]  probability distribution, where I had a discrete space, and now I've got a tree, and essentially,
[15:21.980 --> 15:26.540]  when I have a tree, at every single node, I can either go left, or right, or down, or you know,
[15:26.540 --> 15:33.660]  in this case, my number of children I have is variable, but I will be at a parent node,
[15:33.660 --> 15:40.700]  and I have to go to one of its children. So I always have discrete choices, a discrete choice
[15:40.700 --> 15:47.460]  of where to go. So how does that apply? So here is a very, like, simple cover tree,
[15:47.460 --> 15:54.840]  and I've colored it by the total, like, the population of this node.
[15:55.900 --> 16:02.520]  So this color here is the relative population of this node. So about 66% of the data
[16:03.400 --> 16:10.960]  is underneath this node. About 33% of the data is underneath this node. And you can see that
[16:10.960 --> 16:17.920]  colors are kind of changing. I get a kind of a picture of how dense my data is, but that doesn't
[16:17.920 --> 16:22.680]  help me, because I need to make choices. It doesn't help me make the choices at all steps.
[16:22.800 --> 16:30.820]  So if I go here, this helps me make my choices. So here, I've got 66% of my data over here, and I've
[16:30.820 --> 16:38.840]  got 33% of my data over here. And then once I've taken... I have to take one of those two choices.
[16:38.840 --> 16:45.040]  If I take this choice, I'm now at this node, and I have two choices again. And the relative
[16:45.040 --> 16:51.340]  probability of taking this choice versus this one... Well, this is basically a 0.1 probability
[16:51.340 --> 16:58.440]  of taking this path, and this is a 0.9 probability of taking this path. And once I get to this node,
[16:58.440 --> 17:03.400]  I have a probability of one of going straight down, because I have no other choices.
[17:03.400 --> 17:09.200]  And every single point, I have a discrete distribution, like we had on page one, which
[17:09.200 --> 17:15.840]  tells me where I'm going. And because my tree is geometrically motivated, these choices,
[17:15.840 --> 17:22.680]  which were all discrete and easy to compute, and easy to count, and easy to keep track of,
[17:22.680 --> 17:27.260]  now I can compute the KL divergence, because it's trivial. It's all... it's just counting,
[17:27.260 --> 17:34.520]  taking logs, and summing things up. So this is my prior distribution. As I said,
[17:34.520 --> 17:39.500]  we're going to use some Bayesian statistics. And so here is my prior. How... this is built
[17:39.500 --> 17:46.900]  with my training set. How do I include my test set? So when I see a piece of data coming over
[17:46.900 --> 17:54.200]  the wire, I can make an inference of where it should belong in this tree. So if it was already...
[17:54.200 --> 17:58.400]  if it was in my training set and didn't get involved in building the tree,
[17:58.400 --> 18:03.460]  where would have it ended up? So I have a point here. So this is my new point that came over the
[18:03.460 --> 18:11.020]  wire. And I queried and said, oh well, this point is covered by this parent, and it's covered by
[18:11.020 --> 18:17.460]  that parent, and that parent, and all the way up. So the path for this point goes up this way.
[18:17.460 --> 18:21.980]  And now for each of those elements in those paths, I can increment my probabilistic distribution
[18:21.980 --> 18:26.860]  and get a posterior one that involves... that includes the new information that I got,
[18:26.860 --> 18:31.320]  that there's a new point over here. And now I can do that again with another point.
[18:31.320 --> 18:36.100]  Oh yep, cool. I have a posterior... this point's been added, and you can see that the distribution
[18:36.100 --> 18:41.140]  changes. And then I can add some more points, and I can add some more points. So these are
[18:41.140 --> 18:46.260]  all my posterior. And I can update my original thing by taking the path of each of the points
[18:46.260 --> 18:53.540]  and updating all the little discrete distributions that I have. So this enables me to
[18:54.100 --> 19:00.940]  sort of quickly make a posterior distribution out of my prior distribution, and everything
[19:00.940 --> 19:08.100]  is discrete. And then I can take the KL divergence between my posterior discrete distributions and
[19:08.100 --> 19:14.780]  my prior discrete distributions. So in this case, this one has a probability of 0.33,
[19:14.780 --> 19:21.100]  and here I have a probability of 0.66. And afterwards, after I've included all my training
[19:21.100 --> 19:27.060]  set, this is a probability of like 0.4. And this, you know, this is a probability of 0.6,
[19:27.060 --> 19:31.980]  and this is a probability of 0.4. And those are very different. And I can plug those into the
[19:31.980 --> 19:36.900]  KL divergence thing and get a positive number for the KL divergence at this node. And then I can do
[19:36.900 --> 19:42.220]  it for this node, and this node, and this node, and get all the KL divergences and add them up.
[19:42.220 --> 19:46.420]  And then I can get the total KL divergence with respect to my original tree
[19:46.940 --> 19:50.860]  of the posterior distribution with respect to the prior distribution,
[19:52.560 --> 20:00.520]  which is really useful. And so this basically, yeah, does it. And the code for doing this
[20:01.120 --> 20:04.960]  enables you to do it very quickly, because it's basically just counting
[20:06.240 --> 20:15.680]  a couple logs and some other statistical equations. And the slide just covers how you do it.
[20:16.260 --> 20:24.500]  So let's go back to that original test set. So I've got basically a blown up version of
[20:25.020 --> 20:30.880]  those slides of the Gaussian distributions where I had them, where a large number of points from my
[20:30.880 --> 20:36.280]  training distribution, which was a single Gaussian sampled at the center of the thing.
[20:36.280 --> 20:42.680]  And then I have a small number of points from a test distribution. And in this case, I have
[20:42.680 --> 20:48.300]  100,000 points from my training distribution and only 500 points from my test distribution.
[20:48.300 --> 20:56.620]  And these both have a covariance matrix of all ones. So it's actually quite, and it's
[20:56.620 --> 21:04.160]  20-dimensional. This is actually harder than most datasets. MNIST is about 10-dimensional,
[21:04.160 --> 21:07.140]  if you really get down to it. If you want to know what that is, you can
[21:07.600 --> 21:12.260]  contact me on Twitter and ask me why MNIST might be lower dimensional.
[21:15.140 --> 21:20.420]  And between the training distribution, I build the tree, fix it, and then I make a posterior
[21:20.420 --> 21:27.140]  distribution for a bunch of test distributions. And I take the KL divergence. And you can see
[21:27.140 --> 21:33.740]  it's a bit noisy. The reason being is I generate a lot of the exact position of your points. It's
[21:33.740 --> 21:37.940]  fairly sensitive to these. So this is one of the problems. And one of the things I'll talk about
[21:37.940 --> 21:43.880]  at the end is like, there are ways to normalize this and resolve this, because I've designed it
[21:43.880 --> 21:48.740]  for speed, not really for accuracy. And I'll get into why I did that afterwards, because honestly,
[21:48.740 --> 21:54.400]  it came to the drift stuff backwards. But you can see, the further you get, the further you
[21:54.400 --> 22:00.680]  separate the two distributions, the higher the number gets. So the sanity test, the unit test
[22:00.680 --> 22:07.160]  of does this work is correct. And I'll have to compare this to other methods.
[22:07.860 --> 22:12.940]  This takes big O of, sorry, Wasserstein takes big O of N to the fourth.
[22:13.440 --> 22:18.180]  Wasserstein is the only other drift calculator that I really know
[22:18.740 --> 22:25.280]  works. And is model independent and doesn't have some funny business with
[22:25.280 --> 22:34.800]  building an estimated distribution initially. This takes K log N. Another big difference between
[22:34.800 --> 22:39.320]  Wasserstein is the traditional way of doing Wasserstein requires a relatively equal size
[22:39.320 --> 22:46.160]  test and training set. You can't have the test set be 0.5% of the training set.
[22:46.720 --> 22:53.340]  That would just not work so great on the traditional method I know of. And this one,
[22:53.340 --> 22:58.040]  honestly, it's online. You can do, the test set can be a fraction of the size of the training
[22:58.040 --> 23:06.140]  set. And because it's online, it can do it in real time. And this is stupidly fast. You can track
[23:06.140 --> 23:12.620]  for the Ember open source, Ember malware data set, you build a reasonable couple tree,
[23:12.620 --> 23:21.580]  you can track 16,000 new samples per second on my laptop. So that's fast enough that inference
[23:21.580 --> 23:26.140]  wouldn't be, this would not be the bottleneck. Inference would be the bottleneck if you would
[23:26.900 --> 23:32.640]  build a cloud distribution, a cloud model, and put this as a guard in front of that cloud model.
[23:33.820 --> 23:41.460]  So we built it so that it can go fast enough to defend. And now here is where the thing originally
[23:41.460 --> 23:48.720]  came from, and then we backtracked onto Drift. So here's the test set attack. So this was
[23:48.720 --> 23:57.120]  originally categorized by Ian Goodfellow, and some defenses have been proposed by Nicholas Carlini
[23:58.140 --> 24:07.120]  and others. So, but here is the gist of the attack. Normal users just query. They don't
[24:07.120 --> 24:10.840]  care about whether they got a false positive or a false negative, they just keep querying.
[24:11.820 --> 24:18.280]  And for us, we think of a normal user as being like an enterprise user, trying to figure out
[24:18.280 --> 24:24.800]  whether all the reports coming over the wire of new binaries are malicious or benign, or they'd
[24:24.800 --> 24:31.560]  be submitting every single binary that they see on their network to a service that will tell them
[24:31.560 --> 24:40.840]  whether it was malicious or benign for an ADV product. But a malicious user would be submitting
[24:40.840 --> 24:47.520]  things with a specific goal in mind. They don't care about what they care deeply what the label
[24:47.520 --> 24:53.140]  is, and they care deeply about defining a false negative. They want to find a malicious sample
[24:53.140 --> 24:57.380]  we classify as benign. So what they'll do is they'll try a bunch of stuff until they find
[24:57.380 --> 25:03.600]  something that is misclassified as benign and is actually malicious and bypasses our model,
[25:03.600 --> 25:13.840]  and then they deploy it as far as much as possible. So if you do, if you apply this to,
[25:13.840 --> 25:20.420]  you know, the fast drift calculators to that, you can attach a little drift calculator to each user.
[25:20.560 --> 25:28.500]  And as samples come over the wire, you track it for that user. And if the user starts
[25:28.500 --> 25:32.860]  exploiting something, they're going to get a very boring distribution where it's just
[25:32.860 --> 25:41.520]  one point repeated over and over again. And you spike your distribution, your KL divergence
[25:41.520 --> 25:45.820]  spikes up and then becomes very normal because it's essentially you're tracing the same path
[25:45.820 --> 25:52.240]  over and over and over again. And you can see that there's an exploration phase where the
[25:52.240 --> 25:57.640]  malicious users try to look for a benign misclassified sample, and when they find one,
[25:57.640 --> 26:02.240]  there's an exploitation phase. And the KL divergence rises rapidly. You can see that
[26:02.240 --> 26:10.320]  that this is in log of scale. So this is literally hundreds of times more KL divergence than the
[26:10.320 --> 26:15.280]  benign samples. And here, it's the same thing. Hundreds of times more KL divergence than
[26:15.280 --> 26:24.220]  the benign and the attackers were unsuccessful. Because of this, you can build, you can probably
[26:24.220 --> 26:34.460]  build a very good online defense of your system against a set attack. And here's where basically
[26:34.460 --> 26:41.880]  the system was originally built sort of with this mind and other attacks in mind, because
[26:41.880 --> 26:48.440]  I'm more interested in defending my models than calculating drift. But because it's so
[26:48.440 --> 26:53.220]  bloody fast, it does this really well. And because it's so bloody fast, you can retrofit it.
[26:53.220 --> 27:01.080]  So here's where I'll talk about some next steps. So as I mentioned previously,
[27:01.080 --> 27:07.000]  this thing is a bit noisy. There are ways to fix it. In this case,
[27:07.000 --> 27:13.700]  window sizes and some tuning things that I didn't go into in these slides because I've
[27:13.700 --> 27:19.860]  already got 20 minutes. There are some tuning systems you can use to really clean up and
[27:20.100 --> 27:24.900]  de-noise. There's also some interesting waking functions you can do inside the tree because
[27:25.500 --> 27:36.900]  there's some noise from an overactive leaf node might end up just building a very high KL
[27:36.900 --> 27:41.380]  divergence for just that one node and it throws everything off. And there's a couple other little
[27:41.380 --> 27:45.360]  things like that that still need to be cleaned up before this goes into production. But the
[27:45.360 --> 27:51.560]  system itself is looking very promising and is honestly the fastest of its kind
[27:52.320 --> 28:06.220]  that I can see in the world. And that's basically it for Gokul's ability to detect benign test set
[28:06.220 --> 28:13.480]  and other drift things. The library is at elastic.com, oh sorry, github slash elastic
[28:14.060 --> 28:21.040]  slash Gokul. It's named after my grandmother because the base algorithm for this thing is
[28:21.040 --> 28:24.820]  called GRMA. And if you say that while drunk, you can kind of get grandma.
[28:25.160 --> 28:30.940]  And grandma's a bit too on the nose, so I named it Gokul.
[28:32.680 --> 28:35.420]  Anyway, well, looking forward to hearing you guys on Twitter
[28:35.420 --> 28:40.980]  and Twitch and Slack or whatever. So thank you very much and good night.
